# Reading data directly from a GitHub repository
data_location <- paste0(
"https://raw.githubusercontent.com/",
"username/repository/main/data/dataset.csv"
)
my_data <- read_csv(file = data_location)SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research
Semester 1, 2026
Last updated: 2026-01-23
I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.
By the end of this lecture, you will be able to:
The Case for Data Sharing
“If we can get our dataset off our own computer, then we are much of the way there.”
— Alexander (2023), TSwD
The FAIR principles guide data sharing and management:
The easiest way to start sharing data is through GitHub:
Benefits of GitHub for Data
R packages can be used to share datasets with documentation:
Advantages of Data Packages
babynames, troopdataFor more formal sharing, deposit your data in a repository:
| Repository | Features |
|---|---|
| Zenodo | Free, operated by CERN, provides DOI |
| OSF (Open Science Framework) | Free, integrates with GitHub |
| Harvard Dataverse | Common for journal publications |
| Australian Data Archive | Australian-specific repository |
Why Use Repositories?
Data Dictionary = List of ingredients
Datasheet = Nutrition label
Datasheets for Datasets (Gebru et al. 2021)
Just as electronics come with datasheets, datasets should come with documentation that enables users to understand what they’re working with.
Personally Identifying Information (PII)
PII enables linking observations to actual people:
Protection methods:
| Aspect | CSV | Parquet |
|---|---|---|
| File size | Larger | Smaller (3-4x) |
| Speed | Slower | Faster |
| Data types | Lost | Preserved |
| Human readable | Yes | No |
The Problem with Using Training Data
When we evaluate predictions on the same data used to fit the model, predictions are optimistically biased for assessing generalisation.
Cross-validation addresses this by:
How LOO Works
Regular residuals
LOO residuals
LOO R²
Just as we have R², we can calculate LOO R² using LOO residuals. This gives a more realistic estimate of explained variance for new data.
loo PackageOutput includes:
Interpreting Model Comparison
When LOO is unstable (warning messages), use K-fold:
How K-Fold Works
What Happens with Noise Predictors?
| Metric | Original | With Noise |
|---|---|---|
| R² | 0.21 | 0.22 ↑ |
| LOO R² | 0.20 | 0.19 ↓ |
| Log score | -1872 | -1871 ↑ |
| LOO log score | -1876 | -1880 ↓ |
Adding noise improves in-sample fit but hurts cross-validated performance!
Variation is Central
In observational sciences like economics and political science, replication can be more indirect—for example, analysing local economic activity within different countries.
Three Reasons to Move Beyond p-values
Discretising based on significance tests throws away information
In real problems, there are no true zeroes—everything that could have an effect does have some effect
Comparisons and effects vary by context, so excluding zero isn’t particularly informative
Focus instead on:
Do graph:
Don’t obsess over:
Rule of Thumb
Any graph you show, be prepared to explain. If you can’t explain why it matters, don’t include it.
Coefficients Are Not “Effects”
A regression coefficient is the modelled average difference in the outcome, comparing two individuals that differ in one predictor while being at the same levels of all other predictors.
Benefits of this framing:
Payoffs of Simulation
Start Simple, Build Complexity
“It’s rarely a good idea to run the computer overnight fitting a single model. At least, wait until you’ve developed some understanding by fitting many models.”
Practical strategies:
Advantages of Subsetting
Consider transforming variables:
Don’t Assume Causal Interpretation
Don’t set up one large regression to answer multiple causal questions at once—this is rarely appropriate in observational settings.
Apply Methods to Problems You Care About
“You will need this understanding to interpret your findings and catch things that go wrong.”
Key Principles (from Week 8)
“The process of writing is a process of rewriting. The critical task is to get to a first draft as quickly as possible.”
— Alexander (2023), TSwD
| Section | Purpose |
|---|---|
| Title | Tell your story in one line |
| Abstract | 3-5 sentences covering context, methods, findings, implications |
| Introduction | Self-contained overview—give away the punchline |
| Data | Create a “sense of place” for your data |
| Model | Specify and justify your approach |
| Results | What you found (not what it means) |
| Discussion | Implications, limitations, future work |
MRP in One Sentence
MRP uses a regression model to relate survey responses to characteristics, then rebuilds the sample to better match the population.
Why use MRP?
A Famous MRP Example
MRP is not magic—the laws of statistics still apply—but it can make biased data useful when applied carefully.
Weeks 1-5: Foundations
Weeks 6-13: Modelling
What Makes Good Data Science?
“We consider data science to be the process of developing and applying a principled, tested, reproducible, end-to-end workflow that focuses on quantitative measures in and of themselves, and as a foundation to explore questions.”
— Alexander (2023), TSwD
To Solidify Foundations:
To Go Deeper:
The Most Important Advice
Write code every day. The only way to get better at data analysis is to do data analysis.
“May You Live in Interesting Times”
“Data science needs to insist on diversity, both in terms of approaches and applications. It is increasingly the most important work in the world, and hegemonic approaches have no place.”
— Alexander (2023), TSwD
Key Takeaways from This Course
Good luck with your assessments and future data science endeavours!